Topics

  • Use simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.

  • Estimate the slope and intercept of the regression line using the least squares method.

  • Interpret the slope and intercept of the regression line.

1. Motivation & Library

The library() function is used to access functionality that is provided by R packages, but is not included in base R. install.packages() can be used to install new packages. Run this command from the console.

# install.packages("tidyverse")

First, load the package tidyverse that will be used throughout the tutorial for data visualizations.

library(tidyverse)
load(file='nhanes1518.rda')

The functions head() and names() can be used to explore the data. head() can output the first several rows of the data, while `names() can provide all the names of variables.

head(nhanes1518)
## # A tibble: 6 × 52
##    SEQN WTINT2YR WTMEC…¹ RIDST…² RIAGE…³ RIDAG…⁴ INDFM…⁵ RIDRE…⁶ DMDED…⁷ DMDED…⁸
##   <int>    <dbl>   <dbl>   <int>   <int>   <int>   <dbl>   <int>   <int>   <int>
## 1 83732  134671. 135630.       2       1      62    4.39       3       5      NA
## 2 83733   24329.  25282.       2       1      53    1.32       3       3      NA
## 3 83734   12400.  12576.       2       1      78    1.51       3       3      NA
## 4 83735  102718. 102079.       2       2      56    5          3       5      NA
## 5 83736   17628.  18235.       2       2      42    1.23       4       4      NA
## 6 83737   11252.  10879.       2       2      72    2.82       1       2      NA
## # … with 42 more variables: INDHHIN2 <int>, BMXBMI <dbl>, BMXWAIST <dbl>,
## #   BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>, URXCNP <dbl>,
## #   URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>, URDECPLC <int>,
## #   URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>, URDMBPLC <int>, URXMC1 <dbl>,
## #   URDMC1LC <int>, URXMCOH <dbl>, URDMCOLC <int>, URXMEP <dbl>,
## #   URDMEPLC <int>, URXMHBP <dbl>, URDMHBLC <int>, URXMHH <dbl>,
## #   URDMHHLC <int>, URXMHNC <dbl>, URDMCHLC <int>, URXMHP <dbl>, …
names(nhanes1518)
##  [1] "SEQN"     "WTINT2YR" "WTMEC2YR" "RIDSTATR" "RIAGENDR" "RIDAGEYR"
##  [7] "INDFMPIR" "RIDRETH1" "DMDEDUC2" "DMDEDUC3" "INDHHIN2" "BMXBMI"  
## [13] "BMXWAIST" "BMXWT"    "BMXHT"    "URXUCR"   "WTSB2YR"  "URXCNP"  
## [19] "URDCNPLC" "URXCOP"   "URDCOPLC" "URXECP"   "URDECPLC" "URXHIBP" 
## [25] "URDHIBLC" "URXMBP"   "URDMBPLC" "URXMC1"   "URDMC1LC" "URXMCOH" 
## [31] "URDMCOLC" "URXMEP"   "URDMEPLC" "URXMHBP"  "URDMHBLC" "URXMHH"  
## [37] "URDMHHLC" "URXMHNC"  "URDMCHLC" "URXMHP"   "URDMHPLC" "URXMIB"  
## [43] "URDMIBLC" "URXMNP"   "URDMNPLC" "URXMOH"   "URDMOHLC" "URXMZP"  
## [49] "URDMZPLC" "WTINT4YR" "WTMEC4YR" "WTSB4YR"

In this chapter, we introduce linear regression which aims to model the relationship between a response variable and one or more predictor variables by fitting a linear equation to observed data. The goal of linear regression is to find the best-fitting line (or hyperplane) that minimizes the sum of the squared differences between the observed responses and the values predicted by the equation. The resulting equation can then be used to make predictions about the response for new inputs. In essence, linear regression aims to answer the question of how changes in the independent variables relate to changes in the dependent variable.

We will illustrate investigating the relationship between age and BMI as an example of linear regression.

2. Exploratory Data Analysis

We are interested in the relation between age and BMI, and are especially interested in adults. Before applying data analysis, we want to explore the distribution of these variables:

nhanes1518 <- nhanes1518%>%filter(RIDAGEYR>=18)

This tutorial will be using the nhanes dataset where the variables are described in the file nhanes-codebook.txt. Load this data with the load function and specify the rda data file.

# Basic histogram
ggplot(nhanes1518, aes(x=RIDAGEYR)) + geom_histogram()+labs(x = "waist circumference", title = "Distribution of Age")

# Change the width of bins
ggplot(nhanes1518, aes(x=RIDAGEYR)) + 
  geom_histogram(binwidth=1)+labs(x = "Age", title = "Distribution of Age")

# Change colors
p<-ggplot(nhanes1518, aes(x=RIDAGEYR)) + 
  geom_histogram(color="black", fill="white")+labs(x = "Age", title = "Distribution of Age")

We are interested in the relation between age and BMI. Before applying data analysis, we want to check the distributions of these variables.

We see that the distribution of waist circumference is asymmetric, with peaks at age of around 80 and 0 respectively.

We can also add the mean line, as well as overlay with transparent density plot. The value of alpha controls the level of transparency:

# Add mean line
p+geom_vline(aes(xintercept=mean(RIDAGEYR)),
            color="blue", linetype="dashed", size=1)

# Histogram with density plot
ggplot(nhanes1518, aes(x=RIDAGEYR)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") # add a layer of density

3. Model Assumptions

The assumptions of linear regression model are as follows:

  • Linearity: The relationship between the independent and dependent variables is linear.
  • Independence: The observations are independent of each other, meaning that the value of the dependent variable for one observation is not influenced by the values of the independent variables for other observations.
  • Homoscedasticity: The variance of the errors is constant for all values of the independent variables.
  • Normality: The errors are normally distributed.
  • No multicollinearity: The independent variables are not highly correlated with each other.
  • No Autocorrelation: The errors are not correlated with each other. Violations of these assumptions can affect the accuracy and interpretability of the regression results. It is important to check for these assumptions before applying linear regression and consider alternative models if necessary.

We will verify that our data meets all the assumptions above in Section 5. Model Diagnostics.

4. Linear Regression

Simple Linear Regression

Simple linear regressions take the form

\[Y_i = \beta_0 +\beta_1 X_i +\epsilon_i\] Where \(Y_i\) is the dependent variable, \(X_i\) is the independent variable and \(\epsilon_i\) is the random error term.

  • \(\beta_1\): True slope of the relationship between X and Y
  • \(\beta_0\): True intercept of the relationship between X and Y
  • \(\epsilon\): Error (residual)

We’ll start with a fitting a simple linear model using the lm() function. In the lm() function, the first variable is the response variable and the variables to the right of the ~ symbol are the predictor variable(s). Here we use age as the response, and BMI as the predictor variables.

lm.fit <- lm(RIDAGEYR ~ BMXBMI, data = nhanes1518)

There are several ways that we can examine the model results. The summary() function gives a more extensive overview of the model fit:

summary(lm.fit)
## 
## Call:
## lm(formula = RIDAGEYR ~ BMXBMI, data = nhanes1518)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.786 -16.164   0.086  14.837  32.948 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.41461    0.73402  61.871  < 2e-16 ***
## BMXBMI       0.11531    0.02413   4.778 1.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.48 on 11094 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:  0.002053,   Adjusted R-squared:  0.001963 
## F-statistic: 22.83 on 1 and 11094 DF,  p-value: 1.795e-06

Model Interpretation

  • \(\beta_0\): Not interpretable in our case. For a person with 0 BMI (which is unrealistic), his or her expected age is -2.241 (unrealistic again). \(\beta_1\): For every unit increase in the BMI of a person, his or her age is expected to increase by 1.432 on average.

  • p value: The p value tells us how likely the data we have observed is to have occurred under the null hypothesis (more material on Null hypothesis on subsequent tutorials), i.e. that there is no correlation between the predictor variable age and the response BMI. From the model above, we have a p value of less than 2e-16, which tells us that the predictor variable age is statistically significant.

The coefficients of the linear regression model can be extracted using the coef() function and the confidence interval(s) with the confint() function.

coef(lm.fit)
## (Intercept)      BMXBMI 
##  45.4146068   0.1153059
confint(lm.fit)
##                   2.5 %     97.5 %
## (Intercept) 43.97580422 46.8534094
## BMXBMI       0.06799999  0.1626119

Model Prediction

We can use the predict() function to obtain prediction intervals or confidence intervals for a given value of the predictor variable, BMXWAIST. Note that when using the predict function, the column names and format of the new points at which to predict needs to be the same as the original data frame used to fit the lm() model. If you encounter errors using the predict() function, this is a good first thing to check.

predict(lm.fit, data.frame(BMXBMI = (c(70, 80, 90))), interval = "confidence")
##        fit      lwr      upr
## 1 53.48602 51.54110 55.43094
## 2 54.63908 52.22710 57.05106
## 3 55.79214 52.91114 58.67314
predict(lm.fit, data.frame(BMXBMI = (c(70, 80, 90))), interval = "prediction")
##        fit      lwr      upr
## 1 53.48602 17.21825 89.75379
## 2 54.63908 18.34327 90.93489
## 3 55.79214 19.46215 92.12213

Prediction Interval vs Confidence Interval

Prediction and confidence interval are both statistical concepts that are used to estimate or quantify uncertainty in a particular outcome or parameter. However, they have different meanings and interpretations.

  • Confidence interval: a range of values that is likely to contain the true value of a parameter with a certain degree of confidence. For example, if you are estimating the mean age of a population based on a sample, a 95% confidence interval would provide a range of values within which the true population mean age is expected to lie with 95% confidence. The width of the confidence interval reflects the uncertainty in the estimation, with wider intervals indicating more uncertainty.

  • Prediction: An estimate of a specific value or outcome based on the statistical model. For example, in our case when we use a linear regression model to predict the age of a person based on their BMI, a prediction would be the estimated age for a person of a specific BMI Predictions are usually accompanied by a measure of how accurate the estimate is, such as the mean squared error or the root mean squared error. Unlike a confidence interval which provides a range of values for the parameter, a prediction interval provides a range of values for the actual observation or outcome.

We can plot the variables BMXBMI and BMXWAIST using the plot() function and overlay the regression line found using lm() with the abline() function.

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Age",ylab="Body Mass Index (kg/m**2)")
abline(lm.fit)

Here the dollar sign specifies that we retrieve the predictor columns BMXBMI and BMXWAIST from the dataset nhanes1518

We can experiment with different options for abline() by changing the line width and color in abline().

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)")
abline(lm.fit, lwd = 3, col = "red")

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)", col = "red")

The pch argument in plot() changes the shape/type of the points that are plotted.

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)",pch = 20)

plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)",pch = "+")

Optional: We can make a similar plot using ggplot, where we fit the linear regression model using ggplot().

ggplot(nhanes1518, aes(x = RIDAGEYR, y = BMXBMI)) + 
    geom_smooth(method = "lm", formula = y ~ x, colour = "blue") + 
    geom_point() +
  ggtitle("Age vs. BMI for the nhanes data")

5. Model Diagonostics & Interpretation

The par() function can be used to create a grid of multiple subplots.

par(mfrow = c(2, 2))
plot(lm.fit)

The diagnostic plots show residuals in four different ways:

  • Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good. The model we fitted shows roughly a linear relationship, with no distinct patterns (such as a fan or funnel shape) in the residuals vs. fitted plot.

  • Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line. The Q-Q plot generally follows the straight dashed line, with some deviations at the end towards high values of theoretical quantiles.

  • Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.

  • Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. Based on the residuals vs. leverage plot, there are no influential points according to Cook’s distance. However, there might be some points with high standard residuals values which could be marked as outliers.

Some other metrics from the model:

  • \(R^2\): From the model above, we have an adjusted R-squared value of 0.2302, which indicates that 23.02% of the variability in the response variable BMI can be explained by the change in the predictor variable age.
  • p value: The p value tells us how likely the data we have observed is to have occurred under the null hypothesis (more material on Null hypothesis on subsequent tutorials), i.e. that there is no correlation between the predictor variable age and the response BMI. From the model above, we have a p value of 2.2e-16, which tells us that the predictor variable age is statistically significant.

We can use the residuals() and rstudent() functions to extract the residuals and studentized residuals, respectively, from the linear model and plot them along with the predicted values.

plot(predict(lm.fit), residuals(lm.fit))

plot(predict(lm.fit), rstudent(lm.fit))

Additionally, we can compute the influence matrix for the predictors using the hatvalues() function.

plot(hatvalues(lm.fit))

which.max(hatvalues(lm.fit))
## 8723 
## 8196

6. Multiple Linear Regression

The lm() function can also fit multiple regression models. In this section, we will use RIDAGEYR, BMXWAIST, BMXWT, and BMXHT as predictors of the response variable BMXBMI.

Multiple linear regression allows to evaluate the relationship between two variables, while controlling for the effect (i.e., removing the effect) of other variables.

lm.fit <- lm(BMXBMI ~ RIDAGEYR+ BMXWAIST + BMXWT + BMXHT, data = nhanes1518)
summary(lm.fit)
## 
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR + BMXWAIST + BMXWT + BMXHT, data = nhanes1518)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1176 -0.3262  0.0503  0.3471  8.7914 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.4364793  0.2080287   276.1  < 2e-16 ***
## RIDAGEYR    -0.0039875  0.0005317    -7.5 6.88e-14 ***
## BMXWAIST     0.0166524  0.0015277    10.9  < 2e-16 ***
## BMXWT        0.3443669  0.0012653   272.2  < 2e-16 ***
## BMXHT       -0.3463345  0.0011206  -309.1  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8401 on 10529 degrees of freedom
##   (1314 observations deleted due to missingness)
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.986 
## F-statistic: 1.849e+05 on 4 and 10529 DF,  p-value: < 2.2e-16

Model Interpretation:

  • Intercept: The intercept does not have interpretability since it is unrealistic to have age 0, body waist circumference of 0, height and weight of 0.
  • BMXWT: The coeffcient for the predictor BMXWT is 0.2420969, which means that for every unit increase in the participant’s waist circumference, the BMI is expected to increase by 0.2420969 on average, holding all else constant (holding all other predictor variables, BMXWAIST, BMXWT, and BMXHT constant)
# we can use select to filter the variables of interest
nhanes_core<-nhanes1518%>%select(BMXBMI, RIDAGEYR, BMXWAIST,BMXWT,BMXHT)
# In the lm() formula, a dot . can be used to include all variables in the NHANES data as predictors.
lm.fit1 <- lm(nhanes1518$BMXBMI ~ ., data = nhanes_core)
summary(lm.fit1)
## 
## Call:
## lm(formula = nhanes1518$BMXBMI ~ ., data = nhanes_core)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1176 -0.3262  0.0503  0.3471  8.7914 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.4364793  0.2080287   276.1  < 2e-16 ***
## RIDAGEYR    -0.0039875  0.0005317    -7.5 6.88e-14 ***
## BMXWAIST     0.0166524  0.0015277    10.9  < 2e-16 ***
## BMXWT        0.3443669  0.0012653   272.2  < 2e-16 ***
## BMXHT       -0.3463345  0.0011206  -309.1  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8401 on 10529 degrees of freedom
##   (1314 observations deleted due to missingness)
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.986 
## F-statistic: 1.849e+05 on 4 and 10529 DF,  p-value: < 2.2e-16
# If we want to exclude specific variables from the list of predictors, we can use the `-` notation. In the following example, all predictor variables but `age` are included in the model.
# Including `-1` excludes the intercept from the model.

lm.fit1 <- lm(nhanes1518$BMXBMI ~ .- 1, data = nhanes_core)
summary(lm.fit1)
## 
## Call:
## lm(formula = nhanes1518$BMXBMI ~ . - 1, data = nhanes_core)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0674 -1.4507  0.0126  1.5069 20.8889 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## RIDAGEYR -0.0343057  0.0014932  -22.98   <2e-16 ***
## BMXWAIST  0.3174566  0.0030740  103.27   <2e-16 ***
## BMXWT     0.0900488  0.0024898   36.17   <2e-16 ***
## BMXHT    -0.0483656  0.0008657  -55.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.412 on 10530 degrees of freedom
##   (1314 observations deleted due to missingness)
## Multiple R-squared:  0.9937, Adjusted R-squared:  0.9937 
## F-statistic: 4.127e+05 on 4 and 10530 DF,  p-value: < 2.2e-16

Multicollinearity Diagnostics

Apart from diagnostics for simple linear regression, we also need to perform multicollinearity checks for multiple linear regression:

library(car)
# Calculate VIFs
vifs <- vif(lm.fit)

# Print VIFs
print(vifs)
##  RIDAGEYR  BMXWAIST     BMXWT     BMXHT 
##  1.420864 10.107175 11.503433  1.893344

Variance Inflation Factor (VIF): The VIF is a measure of the increase in the variance of the estimated coefficients due to multicollinearity. A VIF value greater than 5 indicates that there is strong multicollinearity in the model. Here the multicollinearity issue is not significant.

7. Categorical Variable

nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI, nhanes1518$INDHHIN2)
nhanes_income <- dplyr::rename(nhanes_na_removed,income = "nhanes1518$INDHHIN2")%>% na.omit()
nhanes_income <- dplyr::rename(nhanes_income,BMI = "nhanes1518$BMXBMI")
# turn quantitative variable into categorical variable
nhanes_income$income<-as.character(nhanes_income$income)

Now we explore the effect of income on the response BMI. Income is stored as 1, 2,.., 15, 77, 99, a categorical predictor variable, in the dataset. The encoding of income categories can be found in this website:

https://wwwn.cdc.gov/nchs/nhanes/2011-2012/demo_g.htm#INDHHIN2

We want to first drop categories with values 77 (Refused) and 99 (Don’t Know) first:

nhanes_income <- subset(nhanes_income, income!="77" & income!="99")

Then we fit the linear regression on categorical variable income:

summary(lm(BMI ~ income, data = nhanes_income))
## 
## Call:
## lm(formula = BMI ~ income, data = nhanes_income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.053  -5.097  -1.153   3.796  55.708 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.08676    0.42827  67.917  < 2e-16 ***
## income10     0.25666    0.53138   0.483  0.62910    
## income12     0.07855    0.56365   0.139  0.88917    
## income13     1.15974    0.72021   1.610  0.10737    
## income14     0.35724    0.48585   0.735  0.46218    
## income15    -0.60605    0.46019  -1.317  0.18788    
## income2      1.16654    0.57274   2.037  0.04170 *  
## income3      0.65874    0.52391   1.257  0.20866    
## income4      0.39053    0.51242   0.762  0.44599    
## income5      0.61757    0.51184   1.207  0.22763    
## income6      0.91220    0.47882   1.905  0.05680 .  
## income7      0.86657    0.48244   1.796  0.07249 .  
## income8      0.96502    0.49816   1.937  0.05275 .  
## income9      1.40552    0.51288   2.740  0.00615 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.255 on 10176 degrees of freedom
## Multiple R-squared:  0.006989,   Adjusted R-squared:  0.00572 
## F-statistic: 5.509 on 13 and 10176 DF,  p-value: 4.42e-10

Baseline: income category 1 corresponding to a household income of 0 to 4,999 dollars.

Model Interpretation:

  • Intercept: The intercept 25.6489 means that for people in the baseline income category (income category 1 corresponding to a household income of 0 to 4,999 dollars), the BMI is expected to be 25.6489 on average.

  • income6: The coeffcient for the predictor income6 is 1.1473, which means that for participants with household income category 6 (25,000 to 34,999 dollars per ear), the BMI is expected to be 1.1473 higher than that of participants with household income in category 1 (0 to 4,999 dollars), on average.

8. Interaction Terms

There are two ways to include interaction terms in the model, : and *. The : symbol only includes the interaction term between the two variables, while the * symbol includes the variables themselves, as well as the interaction terms. This means that BMXWT*BMXWAIST is equivalent to BMXWT + BMXWAIST + BMXWT:BMXWAIST.

summary(lm(BMXBMI ~  BMXWT* BMXWAIST, data = nhanes1518))
## 
## Call:
## lm(formula = BMXBMI ~ BMXWT * BMXWAIST, data = nhanes1518)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6955  -1.7808  -0.2051   1.5276  20.6194 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.379e+00  4.769e-01   2.892  0.00384 ** 
## BMXWT          4.045e-02  6.577e-03   6.150 8.01e-10 ***
## BMXWAIST       1.913e-01  5.180e-03  36.925  < 2e-16 ***
## BMXWT:BMXWAIST 6.650e-04  5.117e-05  12.997  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.71 on 10530 degrees of freedom
##   (1314 observations deleted due to missingness)
## Multiple R-squared:  0.8539, Adjusted R-squared:  0.8538 
## F-statistic: 2.051e+04 on 3 and 10530 DF,  p-value: < 2.2e-16
# Fit the linear regression model
fit <- lm(BMXBMI ~ BMXWT * BMXWAIST, data = nhanes1518)

# Plot the interaction effects
ggplot(nhanes1518, aes(x = BMXWAIST, y = BMXBMI, color = BMXWT)) + 
  geom_point() + 
  stat_smooth(method = "lm", se = FALSE, fullrange = TRUE) + 
  scale_color_gradient(low = "blue", high = "red") + 
  labs(x = "BMXWAIST", y = "BMXBMI", color = "BMXWT") + 
  ggtitle("Interaction effects between BMXWT, BMXWAIST, and BMXBMI")

The plot shows the relationship between BMXBMI and BMXWAIST for different levels of BMXWT, where each level is represented by a different color. The plot provides a visual representation of how the effect of BMXWAIST on BMXBMI differs for each level of BMXWT: For high levels of BMXWT, we notice that BMXWAIST has the most significant effect on BMXBMI, with the sharpest slope. Whereas, for low levels of BMXWT, we notice that BMXWAIST has far less significant effect on BMXBMI, with a very small slope at a stand still.

A simple way to include all interaction terms is the syntax .^2.

library(dplyr)
nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI)
# data clearning to ignore NA values
summary(lm(nhanes1518$BMXBMI ~ .^2, data = nhanes_na_removed))
## 
## Call:
## lm(formula = nhanes1518$BMXBMI ~ .^2, data = nhanes_na_removed)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.956e-13 -1.400e-16  1.900e-17  1.930e-16  5.786e-15 
## 
## Coefficients: (6 not defined because of singularities)
##                                Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)                  -5.194e-15  2.375e-15 -2.187e+00   0.0288 *  
## SEQN                         -1.453e-20  2.518e-20 -5.770e-01   0.5640    
## WTINT2YR                     -1.138e-19  1.393e-19 -8.170e-01   0.4138    
## WTMEC2YR                      1.221e-19  1.344e-19  9.090e-01   0.3634    
## RIDSTATR                             NA         NA         NA       NA    
## RIAGENDR                     -1.760e-15  9.395e-16 -1.873e+00   0.0611 .  
## `nhanes1518$BMXBMI`           1.000e+00  6.442e-17  1.552e+16   <2e-16 ***
## SEQN:WTINT2YR                 7.582e-25  1.433e-24  5.290e-01   0.5967    
## SEQN:WTMEC2YR                -8.528e-25  1.384e-24 -6.160e-01   0.5377    
## SEQN:RIDSTATR                        NA         NA         NA       NA    
## SEQN:RIAGENDR                 1.502e-20  9.817e-21  1.530e+00   0.1261    
## SEQN:`nhanes1518$BMXBMI`     -3.974e-22  6.755e-22 -5.880e-01   0.5564    
## WTINT2YR:WTMEC2YR            -4.070e-27  7.030e-27 -5.790e-01   0.5626    
## WTINT2YR:RIDSTATR                    NA         NA         NA       NA    
## WTINT2YR:RIAGENDR            -7.333e-21  1.484e-20 -4.940e-01   0.6212    
## WTINT2YR:`nhanes1518$BMXBMI`  7.194e-22  1.182e-21  6.090e-01   0.5429    
## WTMEC2YR:RIDSTATR                    NA         NA         NA       NA    
## WTMEC2YR:RIAGENDR             7.441e-21  1.429e-20  5.210e-01   0.6026    
## WTMEC2YR:`nhanes1518$BMXBMI` -7.133e-22  1.145e-21 -6.230e-01   0.5332    
## RIDSTATR:RIAGENDR                    NA         NA         NA       NA    
## RIDSTATR:`nhanes1518$BMXBMI`         NA         NA         NA       NA    
## RIAGENDR:`nhanes1518$BMXBMI`  2.472e-18  7.725e-18  3.200e-01   0.7490    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.843e-15 on 11080 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.834e+33 on 15 and 11080 DF,  p-value: < 2.2e-16